Abstract
Background: Artificial Intelligence (AI) has shown great potential in enhancing clinical practice. Many medical societies, including the American Society of Clinical Oncology, have developed AI-powered platforms to answer guideline-based questions. However, these platforms face two major limitations: (1) when users ask questions not covered by existing guidelines, the AI may generate fabricated answers to satisfy the user; and (2) the decision process behind the AI's response is often opaque, making it difficult for users to verify. To address these issues, we developed a multi-agent AI system specifically designed for the American Society of Hematology (ASH) guidelines. This system ensures both transparency and fidelity to the guideline documents.
Methods: Our system incorporates all 22 published ASH guidelines, covering topics such as sickle cell disease, venous thromboembolism, acute myeloid leukemia in older adults, immune thrombocytopenia, erythropoiesis-stimulating agents (ESAs), and von Willebrand disease. The multi-agent AI system operates through a three-stage process. First, a “guideline agent” identifies the most relevant guideline based on the user's input. Next, a “clarification agent” examines the chosen guideline and asks the user for additional information to align the question with scenarios explicitly addressed in the document. Once sufficient clarification is obtained, an “answer agent” formulates a response directly grounded in the guideline text, along with an explanation of the reasoning and source location. To evaluate the system's performance, we constructed a 50-question dataset comprising both vague and subtly misleading questions. Some of these questions were topically aligned with the guidelines but described clinical scenarios not explicitly addressed by the existing recommendations. The platform is publicly accessible at https://ash-guidelines-agent.web.app/.
Results: The multi-agent system consistently demonstrated appropriate behavior by requesting clarifying information before answering, and by providing answers along with transparent reasoning process. For example, when asked whether ESAs are indicated for patients with transfusion-dependent myelodysplastic syndromes (MDS), the system responded with follow-up questions “Is the patient diagnosed with lower-risk MDS, and what is the patient's serum erythropoietin (EPO) level?” If the user replied that the patient had lower-risk MDS and the EPO level had not been checked, the system answered: “ESAs may be offered to patients with lower-risk MDS if their serum EPO level is less than 500 IU/L”, while also presenting its reasoning and referencing the relevant guideline section.
To evaluate the system's performance, we compared it with two alternative approaches: first, a direct-answer model that provided responses using the guideline documents but without the multi-agent framework (direct answer), functionally similar to asking a question in ChatGPT with documents uploaded; and second, a model that asks for clarification only after an initial answer was returned (late clarification). When tested on the 50-question dataset, the multi-agent system correctly identified the relevant guideline document in 100% of cases. Among the 36 questions that were answerable within the scope of the guidelines, the multi-agent system achieved 100% accuracy. In comparison, the direct-answer and late-clarification models achieved accuracies of 86% and 89%, respectively. Among the 14 questions that were unanswerable based on existing guidelines, the multi-agent system correctly refused to provide an answer in 86% of cases. This was significantly higher than the refusal rates of 29% and 43% for the direct-answer and late-clarification approaches, respectively. In the two cases where the multi-agent system generated a response to a question not covered by the guideline content, its reasoning process clearly revealed the errors, allowing for easy identification upon review.
Conclusion: We developed a multi-agent AI system that improves guideline adherence and safety by requesting clarifications, providing transparent reasoning, and declining to answer when guidelines are not applicable. The system demonstrated excellent accuracy and provided a more reliable and interpretable approach to AI-assisted clinical decision-making.
This feature is available to Subscribers Only
Sign In or Create an Account Close Modal